10 research outputs found
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
An ontology enhanced parallel SVM for scalable spam filter training
This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart
EUROMOD update : feasibility study : Malta (Tax-Benefit Systems 2007-2010)
The purpose of this study is to examine the technical feasibility of micro-simulation model application for the analysis of impact of policy on social integration from the national as well as from the EU perspective. This is the first time that Malta’s tax-benefit system has been analysed from the angle of the main elements of this system implying the policy rules that are underlying the entitlement criteria defining them. This was an opportunity for the main players in this field to work in synergy on this vital issue: the Ministry for the Family and Social Solidarity, in charge of social benefits, Ministry of Finance responsible for the fiscal policy and income tax system in particular, and the National Statistics Office tasked with income data collection based on the EU-SILC methodology. This Feasibility Study describes the situation as it was in year 2007 and the major changes that have taken place in 2008 and 2009 and 2010.
Firstly, the study describes the main elements of the tax-benefit system namely: income, income tax brackets, capital resources and Social Security contributions. The second section of the study illustrates the main sources of data to be used for modelling purposes and also shows the examples of the calculation of income tax and social benefits. It has been agreed that the EU SILC 2008 data would be used, for income element since Malta has joined this system of data collection way back in 2005.
The third section of the study firstly outlines the qualities and limitations of the input data set. This section also focuses on specificities of Malta’s data collection and possible difficulties regarding model application. The study points at the possible combinations of sample and population databases. Also, simulation possibilities have been specified for both systems separately. Finally, the non-take up of benefit and the issue of tax and benefit fraud illustrate the situation and the possible unknown element on both sides.peer-reviewe